用python抓網頁data,及做圖表之六:抓１０２年之後月營收的data，整合練習

DAY 6

用python擷取網頁上的開放資訊（數據資料），分析及畫出圖表系列第 6 篇

程式設計鐵人賽

timloo

2013-09-21 22:58:58

9928 瀏覽

分享至

試著把１０２年１~8月網頁上的營收資料，一口氣寫入sqlite3裏。

這次會用到更多python的字串處理函式，如split('：')－》以某個分隔符號切開字串。
, strip()把字串前後的空白去掉，replace(",", "")把字串中某個字元換成另一個。

for迴圈中常用的range.
將上次的練習片段湊起來：

import urllib.request
from bs4 import BeautifulSoup
import sqlite3
conn = sqlite3.connect('revenue.db')
conn.text_factory = str
c = conn.cursor()


for k in range(1, 9):
    ym='2013'+str('%02d' %k)
    url = 'http://mops.twse.com.tw/t21/sii/t21sc03_102_'+str(k)+'_0.html'
    response = urllib.request.urlopen(url)
    html = response.read()
    sp = BeautifulSoup(html.decode('cp950','ignore').encode('utf-8')) 
    
    tblh=sp.find_all('table', attrs={ 'border' : '0','width' : '100%' })
    print (ym)
    for h in range(0, len(tblh)):
        th=tblh[h].find('th',attrs={ 'align' : 'left','class' : 'tt' })
        cls=th.get_text().split('：') #產業別
        tbl=tblh[h].find('table', attrs={ 'bordercolor' : "#FF6600" })
    
        trs=tbl.find_all('tr')
        for r in range(0,len(trs)):
            if r>1 and r<(len(trs)-1):
                tds=trs[r].find_all('td')
                td0=tds[0].get_text()
                td1=tds[1].get_text()
                td2=tds[2].get_text().strip().replace(",", "")#
                td4=tds[4].get_text().strip().replace(",", "")#
                td6=tds[6].get_text().strip()
                td7=tds[7].get_text().strip().replace(",", "")#
                td8=tds[8].get_text().strip().replace(",", "")#
                td9=tds[9].get_text().strip()

                rvnlst=(ym,cls[1],td0,td1,td2,'0',td4,'0',td6,td7,td8,td9)
                c.execute('INSERT INTO rvn VALUES (?,?,?,?,?, ?,?,?,?,?, ?,?)', rvnlst)            
conn.commit()

偶發的問題：
在二月時，發生異常，

k=2
url = 'http://mops.twse.com.tw/t21/sii/t21sc03_102_'+str(k)+'_0.html'
response = urllib.request.urlopen(url)
html = response.read()
sp = BeautifulSoup(html.decode('cp950')) #,'ignore'

output:
UnicodeDecodeError: 'cp950' codec can't decode bytes in position 295075-295076: illegal multibyte sequence

這個錯誤在網上很有名，查一下decode函數的說明

B.decode(encoding='utf-8', errors='strict') -> str

Decode B using the codec registered for encoding. Default encoding
is 'utf-8'. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registerd with codecs.register_error that is
able to handle UnicodeDecodeErrors.

這是有兩個參數，第二個參數，預設是'strict'，就會產生UnicodeDecodeError，另外可能的值是'ignore'（忽略）和 'replace'（替換），因為在本例，無法轉碼出現在位置
295075-295076，有點遠，沒法判斷是那個怪字元，故用**'ignore'（忽略）**。

另外, html.decode('cp950','ignore').encode('utf-8')，這是和上個練習不同的地方，
按網路上的說法，是把字串編碼的內碼由'cp950'變成'utf-8'。

本次練習大量使用range,

for h in range(0, 9):
    print(h)
output:
0
1
2
3
4
5
6
7
8

range算python的常用字。

for row in c.execute("SELECT * FROM rvn WHERE cpy='3149' AND cls='光電業'"):
    print (row)
output:
('201201', '光電業', '3149', '正達國際', 626592.0, 642657.0, 404131.0, -2.49, 55.04, 626592.0, 404131.0, 55.04)
.........................................................
('201308', '光電業', '3149', '正達國際', 608760.0, 0.0, 704346.0, 0.0, -13.57, 7386695.0, 5104485.0, 44.7)

有了１８個月的營收。

**小結：**和iInfo先生http://white5168.blogspot.tw/2012/08/blog-post_11.html#.Uj2pXN98pNC的抓月營收版本比較起來（275行，扣掉宣告，print訊息，約250行），用BeautifulSoup的版本行數比較少，不到50行，程式的可讀性並不會太差，有時更直覺，比較易於維護，
所以BeautifulSoup雖是站在巨人的肩膀上，但是設計者對於api設計的易用性，很有一套，
就像jQuery的設計者對javascript的重新包裝，讓jQuery非常流行，很直覺的就會用，直覺是很重要的。BeautifulSoup主要就是find_all, find的交互使用，就可以把一個網頁的data截取出來。